What Fraction of Images on the Web Contain Text?
نویسندگان
چکیده
Web search engines index text represented in symbolic form. However, it is well known that a fraction of the text on the web is present in the form of images, and the textual content of these images is not indexed by the search engines. This fact immediately raises a few questions: i) What fraction of the images on the web contain text? ii) What fraction of the text content of these images does not appear in the web page in symbolic form? Answers to these questions will give the web users an idea about the amount of information being missed by the search engines, and, justify whether or not Optical Character Recognition should be a standard part of search engine indexing. To answer these questions we statistically sample the images referenced in the web pages retrieved by a search engine for specific queries and then find the fraction of sampled images that contain text.
منابع مشابه
WebSeer: An Image Search Engine for the World Wide Web
Because of the size of the World Wide Web and its inherent lack of structure, finding what one is looking for can be a challenge. In fact, some of the most highly visited Web sites are search engines. However, while Web pages typically contain both text and images, most currently available search engines only index text. This paper describes WebSeer, a system for locating images on the Web. Web...
متن کاملWebSeer: An Image Search Engine for the World Wide Web1
Because of the size of the World Wide Web and its inherent lack of structure, finding what one is looking for can be a challenge. PC-Meter’s March, 1996, survey found that three of the five most visited Web sites were search engines. However, while Web pages typically contain both text and images, all the currently available search engines only index text. This paper describes WebSeer, a system...
متن کاملImage Compression: Seeing what's not there
The HTML file that contains all the text for this article is about 25,000 bytes. That's less than one of the image files that was also downloaded when you selected this page. Since image files typically are larger than text files and since web pages often contain many images that are transmitted across connections that can be slow, it's helpful to have a way to represent images in a compact for...
متن کاملContent Based Image Search over the World Wide Web
Most web pages typically contain both images and text. However, most current search engines index documents based on text only. In order to facilitate effective search for images on the web, we need to complement text with the visual content of the images. We often look for images containing specific objects having some particular spatial and topological relations among them. In this paper, we ...
متن کاملA Comparative Study between the Texts and Images of Abu Bakr and Imam ‘Ali in the Ilkhanid and Timurid Copies of the Jamiʿ al-Tawarikh
Among the Jamiʿ al-Tawarikh manuscripts produced in the Rabʿ-i Rashidi, four copies have survived: one in Arabic and three in Persian. A century after their transcription, all these manuscripts were in the possession of the Timurid ruler, Shahrukh. Since these manuscripts were incomplete, Shahrukh ordered Hafiz-i Abru to complete the Persian copies. In the process of completion of one of these ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001